Keyword [Spatiotemporal Attention]
Li S, Bak S, Carr P, et al. Diversity regularized spatiotemporal attention for video-based person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 369-378.
1. Overview
1.1. Motivation
- most existing methods encoding each video frame in its entirely and computing an aggregate representation across all frame
- the remaining visible portions of the person may provide strong cues for re-identification
- features directly generated from entire images can easily miss fine-grained visual cues
In this paper, it proposed spatialtemporal attention model
- multiple spatail attention model (alignment) + diversity regularization term (Hellinger Distance) to not discover the same body
- align corresponding image patches across frames
- determining whether a particular part of the body is occluded or not
temporal attention
automatically discovers a diverse set of distinctive body parts
- extract useful information from all frames without succumbing to occlusions and misalignment
1.2. Related Work
1.2.1. Image-Based Person Re-id
- extracting discriminative features
- learning robust metrics
- Online Instance Matching Loss
1.2.2. Video-Based Person Re-id
(extension of image-based)
- top-push distance
- RNN
- space-time
1.2.3. Attention Models for Person Re-id
- avoid focus on similar region
2. Methods
2.1. Restricted Random Sampling
- divide video into N chunks of equal duration
- random sample an image of each chunk
2.2. Multiple Spatial Attention Models
foucs on body part, hats, bags, …
- ResNet-50 (1 Conv + 4 ResBlock). 8x4 grids
- L. 32
- D. 2048 dimension
the weight of n-th frame, k-th attention part, l-th grid
weighted attention region
Enhance (appendix)
2.3. Diversity Regularization
- collection of each region weight for n-th frame>
- K. the number of attention model
- L. grids
Hellinger Distance. maximize the distance
Regularization Term. will be multipled by a coefficient and added to original OIM loss
- variant
2.4. Temporal Attention
Pooling features arocss time using a per-frame weight is not sufficiently robust, since some frames could contain valuable partial information about an individual. (apply same temporal attention weight to all frames)
- weight across all frames for one attention region
2.5. Overview
- entire video is represented by vector x ∈ (K x D)
2.6. Re-id Loss
- OIM
3. Experiments
3.1. Details
- N = 6
- pretrain ResNet-50 on image-based re-identification datasets
- fixed CNN, train multiple spatial attention model (Diversity Regularization)
- fixed CNN, jointly train the whole network
- SGD, 0.1 drop to 0.01
- 128 dimension L2-normalized
3.2. Ablation Study
3.2.1. different number of spatial attention models
- treating a person as single region instead of two distinct body parts is better